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ABSTRACT 

The  following  problem  is  considered:  given  a  linked  list  of 
length  n ,  compute  the  distance  from  each  element  of  the  hnked 
list  to  the  end  of  the  hst.  The  problem  has  two  standard 
deterministic  algorithms:  a  linear  time  serial  algorithm,  and  an 
0(log  n)  time  parallel  algorithm  using  n  processors.  We 
present  new  deterministic  parallel  algorithms  for  the  problem. 
Our  strongest  results  are: 

1.  0(log  n  log(*)/z)  time  using  n/(log  nlog^^^/i)  processors,  for 
any  fixed  positive  integer  k,  where  log^*)  is  the  ^-th  iterate  of 
the  log  function.  This  algorithm  achieves  optimal  speed-up. 

2.  0(log  n  log'n)  time  using  n/log  n  processors.  Since  log*n 
grows  extremely  slowly  as  a  fimction  of  n  this  algorithm 
achieves  optimal  speed-up  for  all  practical  purposes. 

3.  0(log  n)  time  using  nlog'^^^n/log  n  processors,  for  any  fixed 
positive  integer  k. 

The  algorithms  apply  a  novel  "random-like"  deterministic 
technique.  This  technique  provides  for  a  fast  and  efficient 
breaking  of  a  symmetric  situation  in  parallel. 

1.   Introduction 

The  model  of  parallel  computation  used  in  this  paper  is  the  exclusive- 
read  exclusive-write  (EREW)  parallel  random  access  machine  (PRAM).  A 
PRAM  employs  p  synchronous  processors  all  having  access  to  a  common 
memory.  An  EREW  PRAM  does  not  allow  simultaneous  access  by  more 
than  one  processor  to  the  same  memory  location  for  read  or  write  purposes. 
See  [Vi-83a]  for  a  survey  of  results  concerning  PRAMs. 

Let  Seq{n)  be  the  fastest  known  worst-case  running  time  of  a  sequential 
algorithm,  where  n  is  the  length  of  the  input  for  the  problem  being 
considered.  Obviously,  the  best  upper  bound  on  the  parallel  time  achievable 
using  p  processors,  without  improving  the  sequential  result,  is  of  the  form 
0{Seq{n)/p).  A  parallel  algorithm  that  achieves  this  nmning  time  is  said  to 
have  optimal  speed-up  or  more  simply  to  be  optimal. 

We  present  a  new  deterministic  coin  tossing  technique  for  devising 
parallel  algorithms.  The  technique  uses  the  binary  representation  of  names 
(numbers)  for  breaking  a  symmetric  situation  in  a  "random-like"  fashion. 

Let  m  be  the  size  of  the  memory  of  our  computer.  Our  technique 
performs  well  when  each  variable  in  the  underlying  model  of  computation  is 
represented  by  a  few  bits  (say  0{\og  m)  bits).  Interestingly,  the  technique 
performs  badly  when  each  variable  is  represented  by  many  bits  (say  f{m) 
bits,  where/  is  the  inverse  of  log*).  Representing  each  variable  by  C>(log  m) 
bits  is  in  line  with  typical  definitions  of  RAMs  (see  [AHU-74]).  The  role  of 
PRAMs  is  to  extend  the  RAM  model  to  express  parallelism.  This  extension 


should  have  no  effect  on  the  number  of  values  that  each  variable  may 
assume.  A  variant  of  PRAMs  (called  PRAM- INFINITY)  that  allows  each 
variable  to  assume  infinitely  many  values  has  been  proposed  recently.  The 
PRAM-INFINITY  also  allows  infinitely  large  shared  memory.  This  variant 
(or  closely  related  ones)  was  used  to  prove  lower  bounds  for  various 
interesting  problems;  the  proofs  apply  mathematically  appealing  "Ramsey- 
like" theorems  (see  [FMRW-85],  [IM-85],  [MW-85]). 

It  appears  that  in  the  transition  from  PRAM  to  PRAM-INFINTrY  we 
lose  the  coin  tossing  technique.  For  the  technique  depends  crucially  on  the 
fact  that  each  variable  is  represented  by  few  bits  (say  0(log  m)  bits),  while  in 
the  PRAM-INFINITY  model  this  constraint  does  not  exist;  in  fact,  there  is 
no  restriction  on  the  number  of  bits  representing  a  variable.  This  is 
analogous  to  the  loss  of  bucket  sort  when  we  adopt  the  decision  tree  model. 
(See  [AHU-74]  for  both  an  n(7ilog  n)  time  lower  bound  for  sorting  n 
elements  in  a  decision  tree  model  and  an  0{n)  time  bucket  sort  algorithm). 

We  show  how  to  apply  our  coin  tossing  technique  to  the  list-ranking 
problem  defined  below. 

Input:  A  linked  list  of  length  n.  It  is  given  in  an  array  of  length  n,  not 
necessarily  in  the  order  of  the  linked  list.  Each  of  the  n  elements  (except  the 
last  element  in  the  linked  list)  has  the  array  index  of  its  successor  in  the 
linked  list.  nno  - 

The  problem:  For  each  element,  compute  the  number  of  elements  following 
it  in  the  linked  list. 

The  list  ranking  problem  is  encountered  often  in  the  design  of  parallel 
algorithms.  For  instance,  the  fundamental  "Euler  tour  technique"  for 
computing  various  tree  functions  (see  [TV-83]  and  [Vi-85])  has  the  same 
efficiency  as  the  new  algorithm  presented  here. 

The  problem  has  a  trivial  linear  time  serial  algorithm  and  a  simple 
deterministic  parallel  algorithm.  This  parallel  algorithm  runs  in  time 
C>(log  n)  using  n  processors.  However,  Wyllie  [W-79]  conjectured  that  fl(n) 
processors  are  required  in  order  to  get  C>(log  n)  time.  If  true,  this  would 
imply,  in  particular,  that  there  is  no  optimal  speed-up  parallel  algorithm  for 
/i/log  n  processors.  [KRS-85]  recently  presented  an  optimal  speed-up 
algorithm  for  this  problem  that  nms  in  O(n')  time  using  /i^~*  processors,  for 
fixed  6,  1>6>0.  [Vi-84b]  proposed  to  use  randomized  parallel  algorithms 
for  this  problem.  A  randomized  parallel  algorithm  which  nms  in  0(n/p) 
time  using  p  <  /i/(log  n  log'n)  processors  on  an  EREW  PRAM  was  given. 
The  probability  that  this  will  indeed  be  the  running  time  converges  rapidly  to 
one  as  n  grows.  In  particular,  this  optimal  speed-up  algorithm  runs  in 
"about"  C>(log  n)  time  using  "about"  n/log  n  processors. 

In  this  paper  we  present  new  deterministic  parallel  algorithms.  Our 
strongest  results  are: 
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1.  C>(log  n  log(*)/i)  time  using  n/(log  n\og^^'>n)  processors,  for  any  fixed 
positive  integer  k,  where  log^*^  is  the  ^-th  iterate  of  the  log  function.  This 
algorithm  achieves  optimal  speed-up. 

2.  C>(log  n  \og'n)  time  using  n/\og  n  processors.  Recall  that  log'n  grows 
extremely  slowly  and  can  be  viewed  as  a  constant  for  all  practical  purposes. 
(For  instance,  log*2^^^^^  =  5.  See  the  function  G  in  [AHU-IA],  p.  133). 
Therefore,  we  can  justifiably  say  that  our  algorithm  achieves  optimal  speed- 
up for  all  practical  purposes. 

3.  0{\ogn)  time  using  n\og^^^n/\ogn  processors,  for  any  fixed  positive 
integer  k,  thereby  showing  that  WylUe's  conjecture  is  incorrect. 

The  next  section  presents  the  new  deterministic  coin  tossing  technique 
for  breaking  a  symmetric  situation.  Among  other  things.  Section  3  reviews 
an  optimal  speed-up  deterministic  parallel  algorithm  that  uses  balanced  trees. 
The  algorithm  is  used  later  for  two  purposes:  (1)  as  a  subroutine,  and  (2)  to 
explain  the  new  list  ranking  algorithm.  The  new  algorithm  essentially  grafts 
the  new  technique  onto  the  framework  of  the  balanced  tree  algorithm.  In 
section  4  we  describe  the  basic  version  of  our  algorithm  that  runs  in  time 
OQog  n  loglog  n)  using  n  log*n/(log  n  loglog  n)  processors.  This  algorithm 
achieves  close  to  optimal  speed-up;  it  will  be  quite  adequate  for  all  practical 
purposes.  In  section  5  we  describe  the  algorithms  that  achieve  optimal 
speed-up  and  our  other  results. 

2.   The  deterministic  coin  tossing  techpique 

We  illustrate  the  deterministic  coin  tossing  technique  by  using  it  to  break 
the  symmetric  situation  that  arises  in  the  following  problem. 
Input:  A  connected  directed  graph  G{V,E).    The  in-degree  of  each  vertex  is 
exactly  one.  The  out-degree  of  each  vertex  is  exactly  one.  Such  a  graph  is 
called  a  ring  since  it  forms  a  directed  circuit.  Let  n  =  Ivj. 
We  define  a  subset  t/  of  V  to  be  an  r- ruling  set  of  G  if: 

(1)  No  two  vertices  of  U  are  adjacent. 

(2)  For  each  vertex  v  in  V  there  is  a  directed  path  from  v  to  some 
vertex  in  U  whose  edge  length  is  at  most  r. 

The  r-ruling  set  problem:  Find  an  r-ruling  set  of  V. 

In  order  to  demonstrate  our  basic  technique  we  give  an  0{1)  time 
algorithm  using  n  processors  for  the  [log «] -ruling  set  problem.  The 
algorithm  is  given  for  the  EREW  PRAM. 

Later,  we  present  a  recursive  application  of  the  technique.  It  leads  to  an 
0{k)  time  algorithm  using  n  processors  for  the  [log^^^^l -ruling  set  problem. 
In  particular,  it  provides  an  C>(log*/i)  time  algorithm  using  n  processors  for 
the  2-ruling  set  problem. 
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Assumptions  about  the  input  representation:  The  vertices  are  given  in  an 
array  of  length  n.  The  entries  of  the  array  are  numbered  from  0  to  n  — 1.  The 
numbers  are  represented  as  binary  strings  of  length  [log  n\ .  We  refer  to  each 
binary  symbol  (bit)  of  this  representation  by  a  number  between  0  and 
[log  n]  —  1.  The  rightmost  (least  significant)  bit  is  called  bit  number  0  and 
the  leftmost  bit  is  called  bit  number  [log  n]  —  1.  Each  vertex  has  a  pointer  to 
the  next  vertex  in  the  ring  (representing  its  outgoing  edge).  For  simplicity 
we  assume  that  log  n  is  an  integer-^. 

Here  is  a  verbal  description  of  an  algorithm  for  the  log  n-ruling  set 
problem.  The  algorithm  is  given  later.  Processor  /,  0  <  i  <  az  — 1,  is  assigned 
to  entry  /  of  the  input  array  (for  simplicity,  entry  i  is  called  vertex  i).  It  will 
attach  the  number  i  to  vertex  /.  So,  the  present  "serial"  number  of  vertex  i, 
denoted  SERIALqQ),  is  z.  Next,  we  attach  to  vertex  /  a  new  serial  number, 
denoted  SERIALi(i),  as  follows.  Let  /j  be  the  vertex  preceding  i  and  let  Z2  be 
the  vertex  following  /.  (That  is  (i\,i)  and  {(,(2)  ^®  i^i  E).  Let  j  be  "the 
nimiber  of  the  rightmost  bit  in  which  i  and  12  differ".  Processor  /  assigns  j  to 
SERIAL  i(i). 

Example.  Let  i  be  ...010101  and  12  be  ...111101.  The  number  of  the 
rightmost  bit  in  which  i  and  (2  differ  is  3  (recall  the  rightmost  bit  has  number 
0).   Therefore,  SERIAL^{i)  is  3. 

Remark  (Due  to  B.  Schieber).  /  can  be  computed  by  a  constant  nimiber  of 
standard  operations,  as  follows.  Without  loss  of  generality  suppose  i  ^  12 
(otherwise  interchange  the  two  numbers).  Set  h  =  i  —  i2,  and  k  =  h  —  1.  (So 
h  has  a  1  for  bit  number  j ,  and  a  0  for  bits  of  lesser  significance,  while  k  has 
a  0  for  bit  number  j,  and  a  1  for  bits  of  lesser  significance;  also,  h  and  k 
agree  on  the  bits  of  higher  significance.)  Compute  /  =  hOk,  where  O  is  the 
exclusive-or  operation.  We  observe  /  is  the  unary  representation  of  j+1.  So 
it  just  remains  to  convert  this  value  from  unary  to  binary,  and  then  to 
subtract  one. 

Next,  we  show  how  to  use  the  information  in  vector  SERIALi  in  order  to  find 
a  log  n -ruling  set. 

Fact  1:  For  all  i,  SERIALi{i)  is  a  number  between  0  and  log  n  —1  and  needs 
only  [loglog  n]  bits  for  its  representation.  For  simplicity  we  will  assume  that 
loglog  «  is  an  integer. 

Fact  2:  Suppose  SERIAL  i(i)  is  a  local  minimum.  (That  is, 
SERIALi{i)  <  SERIALi(ii)  and  SERIALi(i)  <  SERIAL^{i2),  where  i^  is  the 
vertex  preceding  /,  and  ii  is  the  vertex  following  /.)  Then: 


'The  base  of  all  logarithms  in  the  paper  is  2. 
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(a)  At  least  one  of  SERIALi(i),  SERIALi(i2),  and  SERIAL^  of  the 
log  n  —  2  vertices  immediately  further  beyond  12  is  a  local  maximum. 

(b)  Let  k  be  such  a  local  maximum  that  is  closest  to  i.  Then,  at  least 
one  of  SERIALiik)  and  SERIAL^  of  the  log  n  -1  vertices  immediately 
further  beyond  it  is  a  local  minimum. 

(c)  The  length  of  the  shortest  path  from  any  vertex  in  G  to  a  (vertex 
which  provides  a)  local  extremum  (minimum  or  maximum),  with 
respect  to  SERIALi,  is  at  most  log  n  —  1. 

Observe  that  several  local  minima  (or  maxima)  may  form  a  "chain"  of 
successive  vertices  in  G.  Requirement  (1),  in  the  definition  of  an  r-ruling 
set,  does  not  allow  us  to  include  all  these  local  minima  in  the  set  of  selected 
vertices.  Our  algorithm  exploits  the  alternation  property  (defined  below)  of 
vector  SERIALi  to  overcome  this  problem. 

The  alternation  property:  Let  /  be  a  vertex  and  j  be  its  successor.  If  bit 
nimiber  SERIALi{i)  of  SERIALQ{i)  is  0  (resp.  1),  then  this  bit  is  1  (resp.  0)  in 
SERIALoiJ). 

Suppose  that  /i,i2  *  "  "    is  a  chain  in  G  such  that  SERIAL i{i)  is  a  local 
minimum  (resp.  maximum)  for  every  /  in  the  chain.  Then: 

Fact  3:  For  all  vertices  in  the  chain  SERIAL-^  is  the  same  (i.e., 
SERIALi(ii)  =  SERIALiii-2)  =    •  •  •  )•  (By  definition  of  local  minimum). 

Below,  we  consider  bit  number  SERIALi(ii)  of  S^RIALq  for  all  vertices  in  the 
chain.  Let  /;,//+ 1  be  two  adjacent  vertices  in  the  chain. 

Fact  4:  Bit  number  SERJALi(ii)  of  SERIALqUi)  is  not  equal  to  bit  number 
SERlALi{i'^  of  SERIALQ{ii+{).  (This  is  readily  implied  by  the  alternation 
property). 

We  select  all  vertices  i  that  are  local  minima  and  satisfy  one  of  the  following 
two  conditions: 

(1)  Neither  of  /'s  neighbors  (the  vertices  adjacent  to  i)  is  a  local 
minimum. 

(2)  Bit  number  SERIALi{i)  is  1. 

We  say  an  unselected  vertex  is  available  if  neither  of  its  neighbors  was 
selected  and  it  is  a  local  maximum.  We  select  all  available  vertices  /  that 
satisfy  one  of  the  following  two  properties. 

(1)  Neither  of  I's  neighbors  is  available. 

(2)  Bit  number  SERIALi(i)  is  1. 

The  selected  vertices  form  a  log  /i-ruling  set.  Requirement  (1)  is  satisfied 
since  we  never  select  two  adjacent  vertices.  Requirement  (2)  is  satisfied  by 
Fact  2(c)  and  since  every  local  extremum  is  either  selected  or  is  a  neighbor  of 
a  vertex  that  was  selected. 
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Less  informally  we  write  the  algorithm  as  follows.  (Later,  we  will  refer 
to  this  as  the  basic  step.) 

for  Processor  i,  0  ^  /  ^  n  —  1,  pardo 

SERIALoii)  :=  / 

SERIALi{i)  :=    "the  minimal  bit  m  which  SERIALoii)  differs  from 

SERIALq  of  the  following  vertex" 
if  SERIALiQ)  is  a  local  minimvim  with  respect  to  the  two  neighbors  of  z 
then  if  either  of  the  following  is  satisfied: 

(1)  neither  of  the  vertices  adjacent  to  i  is  a  local  minimum 

(2)  bit  number  SERIAL^ii)  of  SERIALoiO  is  1 
then  select  i 

if  neither  /  nor  any  of  its  neighbors  were  selected  and  if  SERIALi(i)  is 

a  local  maximum  with  respect  to  the  two  neighbors  of  i 
then   (**  i  is  available,  and  **)  if  either  of  the  following  is  satisfied: 

(1)  neither  of  the  vertices  adjacent  to  i  is  available 

(2)  bit  number  SERIALi{i)  of  SERIALQ(i)  is  1 
then  select  i 

Below,  we  show  how  to  apply  the  basic  step  repeatedly  in  order  to  find  a  2- 
ruling  set.  .  _ 

The  k-th  application  of  the  basic  step.   ,  ' 

In  order  to  prepare  the  input  for  the  )t-th  application  of  the  basic  step, 
we  "delete"  from  G  the  vertices  that  were  selected  in  the  previous  k—l 
applications,  their  neighbors,  and  the  edges  incident  to  any  vertex  being 
deleted. 

The  input  for  the  ^-th  application  of  the  basic  step  is  the  remaining  graph 
and  vector  SERIAL j^^i.  SERIAL^-i  will  play  the  role  played  above  by  SERIALq 
and  a  new  vector  SERIAL^,  will  play  the  role  of  SERIAL^.  The  degree  of  each 
vertex  in  the  input  graph  is  at  most  2  (if  the  directions  of  the  edges  are 
ignored).  It  will  be  very  simple  to  extend  the  basic  step  to  handle  vertices 
whose  degree  is  <  1.  Vertices  whose  degree  is  2  are  treated  as  in  the  basic 
step  (unless  they  have  a  neighbor  whose  degree  is  1) .  The  k-th  application  of 
the  basic  step  wiU  be  as  follows.  (For  an  explanation  see  Fact  5  below.) 
for  processor  z,  0  <  /  <  n  — 1,  pardo 

if  vertex  z  or  one  of  its  neighbors  have  been  selected 

in  a  previous  application  of  the  basic  step 
then  "delete"  vertex  z  and  the  edges  incident  to  it 
for  processor  /,  0  <  z  <  n  — 1,  such  that  z  is  in  the  remaining  graph  pardo 
case  1   deg(i)  =  2 

then  compute  SERIAL j^{i) 
if  the  degree  of  each  of  z's  two  neighbors  is  2 
then  apply  the  basic  step  to  z 
case  2   deg(i)  =  0 
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then  select  / 
case  3  deg(i)  =  1 

then  if  either  of  the  following  is  satisfied 

(1)  the  degree  of  z's  neighbor  is  2 

(2)  z's  neighbor  is  its  successor 
then  select  i 

The  following  fact  helps  to  clarify  the  operation  of  the  k-th  application  of  the 
basic  step. 

Fact  5:  Let  ij  be  adjacent  in  the  input  graph  for  the  ^-th  apphcation.  Then: 

SERIALi,-i{i)  i^  SERIALk-iij).  (For  k=\  this  inequality  clearly  holds. 
We  show  that  it  also  holds  \i  k  >  1.  If  they  were  equal  each  of  them 
had  to  be  a  local  maximum  or  local  minimum  at  the  (^— l)-st 
application.  The  selection  of  the  ruling  set  implies  that  each  local 
maximum  or  local  minimum  is  either  selected  or  has  a  neighbor  being 
selected.  Therefore,  it  must  have  been  deleted  and  cannot  be  included 
in  this  input  graph). 

Fact  6:  It  is  easy  to  deduce  that  the  output  graph  consists  of  simple  paths 
each  of  length  at  most  loglog  •  •  •  log  n  —  1  (counting  edges)  where  the 
sequence  includes  k  "log"s.  (Again,  we  assmne  for  simplicity  that  each 
apphcation  of  a  sequence  of  logs  to  n  produces  only  integers) . 

We  finish  this  description  with  three  obvious  conclusions. 

(1)  After  a  total  of  log'n  apphcations  we  delete  all  vertices  in  the 
graph. 

(2)  The  vertices  that  were  selected  form  a  2-ruling  set. 

(3)  The  cardinahty  of  a  2-ruUng  set  (in  a  ring)  is  at  least  n/3. 

If  our  original  input  is  a  directed  path  of  n  vertices,  rather  than  a  ring,  we 
obtain  a  2-ruling  set  by  applying  the  basic  step  \og*n  times,  as  above.  To 
obtain  a  log'^^^n-ruling  set  we  apply  the  basic  step  k  times. 

General  remarks. 

1.  Readers  famihar  with  randomized  algorithms  may  be  tempted  to 
solve  these  problems  using  randomization.  We  already  mentioned  that 
[Vi-84b]  did  so  for  (the  related)  hst  ranking  problem.  Our 
deterministic  technique  was  inspired  by  such  a  randomized  approach. 

2.  The  [log  /i]-ruhng  set  algorithm  is  valid  even  for  models  of 
distributed  computation  that  allow  only  local  communication  and  do 
not  have  a  shared  memory  like  a  PRAM.  We  do  not  elaborate  on  this. 
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3.   Balanced  tree  algorithms 

3.1.  Preliminaries 

Theorem  (Brent).  Any  synchronous  parallel  algorithm  taking  time  t  that 
consists  of  a  total  of  x  elementary  operations  can  be  implemented  by  p 
processors  within  a  time  of  \x/p^  +  t. 
Proof  of  Brent's  theorem.  Let  Xi  denote  the  number  of  operations  performed 

by   the   algorithm   in   time   i  ( 2  ^/  =  ^)-     We   use   the  p    processors   to 

1 
"simulate"  the  algorithm.  Since  all  the  operations  at  time  /  can  be  executed 
simultaneously,  they  can  be  computed  by  the/?  processors  in  [-^j/pl  units  of 
time.   Thus,  the  whole  algorithm  can  be  implemented  by  p  processors  in  time 

i.\xi/p^^i,{x,/p  +  V)^  \x/p^  +  t  .D 

1  1 

Remark.  The  proof  of  Brent's  theorem  poses  two  implementation  problems. 
The  first  is  to  evaluate  Xi  at  the  beginning  of  time  /  in  the  algorithm.  The 
second  is  to  assign  the  processors  to  their  jobs. 

Recall  the  following  standard  deterministic  parallel  algorithm  for  the  list- 
ranking  problem  (defined  in  the  Introduction).  Say  that  we  have  n 
processors.  Assign  a  processor  to  each  of  the  n  elements.  Denote  the  pointer 
of  element  i  of  the  input  array  by  D{i)  and  initialize  R{i):—  1,  l</<n.  We 
set  D{t)  :=  "end  of  list"  (where  t  is  the  last  element  in  the  linked  list), 
D("end  of  Ust")  :=  "end  of  list"  and  R  ("end  of  list")  :=  0. 

Iterate  |log  n\  times: 

for  processor  i,  l<i<n,  pardo  (perform  in  parallel) 

R{i)  :=  R{i)  +  R{D{i));D{i)  :=  D{D{i))  (To  be  called  the  short-cut 
operation,  performed  by  /  at  D{i)).    (See  Fig.  2.) 

Note  that  n(nlog  n)  short-cuts  are  made  by  this  algorithm.  It  runs  in  time 
0((/zlog  n)/p  -I-  log  n)  using  p  processors  on  an  EREW  PRAM  and  solves 
the  list  ranking  problem,  by  placing  the  results  in  the  vector  /?. 

Implementation  Remark  1.  In  order  to  derive  this  running  time  from  Brent's 
theorem  n  has  to  be  broadcast  to  all  /?  processors.  This  takes  an  additional 
C>(logp)  time. 

Implementation  Remark  2.  As  presented  the  algorithm  is  not  EREW  since 
there  are  concurrent  reads  at  "end  of  list".  This  can  be  avoided  by  instructing 
every  processor  i  to  quit  when  D(z)  =  "end  of  list". 

3.2.  Balanced  binary  tree  parallel  algorithms. 

One  simple  pattern  of  optimal  speed-up  deterministic  parallel  algorithms 
uses  the  balanced  binary  tree.  This  pattern  was  used,  among  many  others,  by 
[W-79],  [CLC-81]  and  [Vi-84a].    Let  us  first  demonstrate  this  pattern  on  the 
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problems  of  computing  sums  and  prefix  simis. 

Input:  An  array  of  n  numbers  A(l)^(2),...^(n).  Assume,  without  loss  of 
generality,  that  log2  n  is  an  integer. 

Problem:  Compute  their  sum. 

Algorithm:  "Plant"  a  balanced  binary  tree  with  n  leaves  on  the  array.  The 
nodes  of  the  tree  at  level  h  are  denoted  [hj],  l<y<2'°g''~''.  See  Fig.  3. 
Leaf  [0  J]  corresponds  to  A(J).  Associate  a  number  B[hJ]  with  node  [hJ]  of 
the  tree. 

Initialization:   for  all  l<y<n  par  do  B[0,j]  :=  A(J). 

for  h  :=  1  to  log  n  do 

for  all  l<y<2i°e''-''  pardo5[;2j]  :=  B[h-l,2j-l]  +  B[h-l,2j']. 

B[\og  n,l]  holds  the  desired  sum. 

Think  first  about  an  n  processor  implementation  of  this  summation 
algorithm.  It  runs  in  0{\ogn)  time.  Then  apply  the  proof  of  Brent's  Theorem 
to  get  an  alternate  implementation  that  uses  only  n/log  n  processors  and  runs 
in  0(log  n)  time.  This  summation  algorithm  can  be  extended  to  solve  the 
following  prefix  sum  problem. 

Input:  Same  as  for  the  summation  problem. 

i 

Problem:  Compute  ]^  A(J)  for  all  l<z<n. 

1 

Algorithm:  Perform  the  summation  algorithm  given  above,  thereby  obtaining 
all  the  B  values.  An  additional  "down-sweep"  of  the  tree  (from  the  root  to 
the  leaves),  which  roughly  amounts  to  reversing  the  operation  of  the 
summation  algorithm,  will  complete  the  job. 

Associate  another  number  C[hJ]  with  each  node  [hJ]. 

Initialization:   C[log/i,l]  :=  0. 

for  /i  :=  log  n  — 1  downto  0  do 

for  all  l<y<2^og"-''  pardo 
if  7  is  odd 

thenC[;ij]  :=  C[h  +  l,(j  +  l)/2] 
else  C[hJ]  :=  C[h+lJ/2]  +  B[hJ-l]. 
for  all  l<;</z  pardo  C[OJ]  :=  C[0J]  +  B[OJ]. 

C[OJ],  l<y<«,  hold  the  desired  prefix  sums.  This  algorithm  can  also  be 
implemented  to  run  in  0{n/p  + log  n)  time  using  p  processors  on  an  EREW 
PRAM.  (Apply  Brent's  theorem  and  Implementation  Remark  1.) 

A  wishful  thought.  We  want  to  find  an  algorithm  for  the  hst  ranking  problem 
that  performs  a  total  of  0(n)  short-cuts.  If  we  could  "plant"  a  balanced 
binary  tree  in  our  Unked  hst  (in  the  order  of  the  linked  list)  it  would  solve 
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our  problem:  enter  a  one  at  each  leaf  and  apply  the  prefix  sum  algorithm.  A 
closer  look  at  the  summation  part  of  such  a  prefix  sum  computation  reveals 
the  following: 

The  operation  of  the  for  statement  (of  the  summation  algorithm)  for 
/z  =  l  corresponds  to  short-cuts  at  every  odd  location  in  the  Unked  list. 
This  results  in  a  new  linked  list  that  connects  only  the  even  locations 
of  the  original  list,  thereby  halving  its  length.  Then,  the  for  statement 
for  /z  =  2  corresponds  to  short-cuts  at  odd  locations  of  the  new  linked 
list,  and  so  on.  See  Fig.  4.  Observe  that  the  for  statement  of  the 
summation  algorithm  never  performs  a  short-cut  at  two  successive 
elements  of  the  linked  list  at  hand;  and,  therefore,  the  "input"  to  any 
operation  of  this  for  statement  is  a  single  linked  list. 

Remark:  The  problem,  of  course,  is  that  we  do  not  know  how  to  plant  a 
balanced  binary  tree  with  respect  to  the  linked  list  without  actually  first 
solving  the  list  ranking  problem  itself,  since  this  "planting"  needs  the  ranking 
mod  2,  mod  4,  mod  8,...  as  explained  above. 

Each  operation  of  the  for  statement  has  the  following  two  features. 

(1)  The  output  is  a  single  list  whose  length  is  half  the  length  of  the 
input. 

(2)  It  takes  (9(1)  parallel  time  to  execute. 

We  will  use  an  algorithm  wliich  approximates  these  two  features.  In  our  new 
algorithm  we  plant  an  "approximately  balanced  tree"  (it  will  be  a  2-3  tree). 
Each  leaf  of  the  tree  corresponds  to  an  element  of  the  list,  and  each  level  of 
the  tree  corresponds  to  an  iteration  of  the  for  statement.  For  a  given  level  of 
the  tree,  the  nodes  at  this  level  correspond  to  those  elements  of  the  list  over 
which  shortcuts  have  not  yet  been  made  (by  iterations  of  the  for  statement 
corresponding  to  lower  levels  of  the  tree).  For  each  level  of  the  tree  we 
divide  the  elements  of  the  list  (corresponding  to  nodes  at  this  level)  into  two 
sets:  those  that  are  shortcut  (by  the  corresponding  iteration  of  the  for 
statement),  called  victims,  and  those  that  are  not  shortcut  (called  survivors). 
In  order  to  approximately  achieve  properties  (1)  and  (2)  above,  we  require 
these  two  sets  to  meet  the  following  two  constraints: 

(a)  If  an  element  is  a  survivor  then  its  successor  (if  any)  is  a  victim. 

(b)  One,  at  least,  of  every  three  adjacent  elements  is  a  survivor. 

By  (a)  at  most  one  half  of  the  elements  are  survivors.  By  (b)  each  survivor 
need  perform  at  most  two  shortcut  operations  to  remove  all  the  victims  from 
the  hst.  Hence  in  0(1)  parallel  time  (using  n  processors)  we  obtain  a  single 
linked  list  containing  at  most  half  as  many  elements  (assuming  we  can 
separate  the  elements  into  survivors  and  victims) . 
But  a  2-ruling  set  provides  an  appropriate  set  of  survivors! 
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4.   The  basic  list  ranking  algorithm 

Initialization:  m  :=  n.  As  in  the  standard  deterministic  a]gorithm,  denote 
the  pointer  of  element  i  by  D(/)  and  initialize /?(/)  :=l,0</<n  — 1. 

The  algorithm  which  is  given  later  should  be  read  together  with  the 
commentary  below.  The  purpose  of  the  while  loop  of  the  algorithm  is  to 
"thin  out"  the  input  linked  list  into  a  list  of  length  <  /i/log  n.  The  input  to 
each  iteration  of  the  while  loop  is  a  linked  list  of  length  m  stored  in  an  array 
of  length  m.  Vector  D  contains,  for  each  element,  the  next  element  in  this 
linked  list. 

The  purpose  of  Step  2  is  to  enter  either  the  value  1  or  the  value  0  into 
RULING  (J),  for  each  j,  O^y^m  — 1,  so  that  those  elements  with 
RULING(J)  =  1,  1  ^  7  :^  m,  form  a  2-ruling  set  of  the  directed  graph.  Step 
2  uses  essentially  the  algorithm  of  Section  2  for  finding  a  2-ruling  set. 
In  Step  3  we  shortcut,  in  parallel,  over  eachy  such  that  RULING  (J)  =  0.  The 
resulting  list  will  contain  exactly  those  elements  in  the  2-ruling  set,  of  which 
there  are  at  most  m/2.  We  make  some  further  comments  on  the  operation  of 
this  step. 

(a)  Each  element  j  for  which  RULING  (J)  =1  (an  element  of  the  2-niling 
set)  is  followed  by  at  least  one  and  at  most  two  elements  for  which 
RULING  is  0. 

(b)  Each  element  over  which  we  perform  a  shortcut  will  remain  with  no 
incoming  pointers.  Such  elements  will  be  "deleted"  in  Step  4. 

(c)  The  parameter  t  stands  for  the  present  time.  The  information  in 
OP(i,t)  enables  us,  later  on,  to  reconstruct  the  operation  of  processor  i  at 
time  t.  This  is  used  in  Step  6  to  derive  the  final  value  of  R(D(J))  by 
subtracting  the  present  value  of  R(J)  from  the  final  value  of  R(j).  For 
this  reason  we  preferred  here  to  name  the  processors  performing  the 
operations  rather  than  to  use  the  framework  of  Brent's  theorem. 

Step  4  contracts  the  input  array  for  the  present  while  loop  iteration  into  a 

new  array  that  contains  exactly  those  elements  Ln  the  new  linked  list. 

When  we  arrive  at  Step  5,  the  length  of  the  linked  list  at  hand  is  <n/log  n. 

Step  5  apphes  the  standard  hst  ranking  algorithm  in  order  to  find  the  ranking 

of  each  element  in  this  linked  list. 

Step  6  extends  the  Hst  rankings  to  all  elements  of  the  original  linked  list  using 

the  information  in  OP (.,.). 

while  m  >  nAog  n  do 
Step  1.  (Initialization  for  the  present  while  loop  iteration). 
fory,  0  <  J  <  m  — 1,  pardo 

seriaLqU)  :=  ;• 

step  2.  Compute  a  2-ruling  set  into  vector  RULING. 

From  now  on  we  specify  for  each  instruction  the  processors  that  perform 
it.  Suppose  p  processors  are  available.  Processor  z,  1  <  f  <  p,  is  assigned 
to  segment  [{i-l)m/p,  .  .  .  ,im/p  -  1]  of  the  array  that  forms  the  input  to 
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this  whileloop  iteration.    (For  simplicity  we  assume  that  m/p  is  an  integer. 
Otherwise,  we  could  assign  Processor  i  to  the  segment  including  all  the 
integers  in  the  half  open  interval  {{i-l)m/p  -1;  im/p  -1].  ) 
Step  3. 

for  Processor  f ,  1  <  /  <  p ,  pardo 
for  y  :=  (z  — l)m/p    to  im/p  —  1  do 

if  RULING  0)  =  1 
then  OPii,t)  :=  {D(J)JM)y, 
R(J)  ■=  RU)  +  RiD(J))  ;  D(J)  :=  DiDQ))  (shortcut). 
if  RULING  {D  (J))  =  0 
then  (9P(f,0  :=  iD(j)J J^U)); 
R(J)  :=  ^0')  +  RiDij))  ;  D(J)  :=  D(D(J))  (shortcut). 
Step  4.Perform  the  Ijalanced  binary  tree  prefix-sum  computation  described 
in  the  previous  section  with  respect  to  the  vector  RULING.  As  a  result: 

(1)  m  :=  2^  RUUNGiJ),  and 

(2)  each   element  j   with  RULING  (J)  =  1   gets   its   entry   number  in   a 
-     (contracted)  array  of  length  m  containing  the  output  linked  list. 

(This  array  is  the  input  for  the  next  iteration  (if  any)  of  the  whileloop.) 
od 

Let  Ti  (resp.  T2)  be  the  first  (resp.  last)  time  unit  for  which  an  assignment 
into  0P(  ,  )  was  performed. 

Step  5. Apply  a  simulation  of  the  standard  deterministic  algorithm  by  p 
processors  to  the  current  an'ay. 

Step  6. 

forProcessor  /,!</</?,  pardo 
for  t  :=  T2   downto  T-^  do 

R{OP{i,t).l)  :=  R{0P{i,t).2)  -  0P{i,t).3   . 
(Comment.    OP (i, t). k,  k=  1,2, 3,    represent    the    fields    of 
OP{i,t).    Also,  recall  Comment  (c)  in  the  verbal  description 
of  Steps.) 

Implementation  remark.  Each  time  m  gets  a  new  value,  broadcast  it  to  all 
processors  as  in  Implementation  Remark  1  of  the  previous  section. 

Complexity 

Initialization  requires  0{n)  operations  and  0(1)  time.    Let  us  focus  on  one 

iteration  of  the  while  loop. 

Step  1  takes  0{m)  operations  and  0(1)  time. 

Step  2  takes  0{m\og'm)  operations  and  0(log*m)  time. 

Step  3  takes  0{m)  operations  and  0(1)  time. 

Step  4  takes  0{m)  operations  and  0(log  m)  time. 

So  each  iteration  of  the  while  loop  takes  0(mlog*m)  operations  and  0(log  m) 

time.  Each  such  iteration  results  in  a  linked  list  whose  length  is  <  1/2  the 
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length  of  the  list  when  the  iteration  started.    Therefore,  after  OQoglog  n) 

iterations  we  get  a  list  whose  length  is  <  n/log  n.   Summing  up  the  operation 

and  time  complexity   of  the  while   loop  gives   0{n\og'n)   operations   and 

C>(log  nloglog  n)  time. 

Step  5  takes  0{n)  operations  and  (9(log  n)  time. 

Step  6  requires  the  same  number  of  operations  and  time  as  all  the  iterations 

of  Step  3,  since  it  follows  its  "footsteps". 

So   we   got   a  total   of   Oinlog'n)    operations    and   0(log  nloglog  n)   time. 

Applying  Brent's  theorem  we  get  0{{n\og'n)/p)  time  using  any  number 

p  <  {nlog'n)/{\og  nloglog  n)  of  processors.    We  know  that  any  such  result 

can       be       alternatively       stated       as       0(log  n  loglog  n)       time       using 

(nlog*n)/(log  nloglog  n)  processors.    We  leave  the  reader  to  verify  that  the 

implementation  problems  as  per  the  remark  following  Brent's  theorem  can  be 

readily  overcome. 

5.   The  Optimal  Algorithm 

We  describe  an  algorithm  that  runs  in  time 
0{nk/p  +  log'^^^nlogn  +  k\ogn)  using  any  number  p  of  processors,  for 
/tr^log'n  (henceforth,  we  assume  /:<log'n).  We  deduce  the  following 
results. 

(1)  For  fixed  k,  with  p  =  n/(loenlog(*)n)  we  achieve  a  running  time 
of  C>(/:lognlog(*^n)  =  C>(lognlog^*)n);  this  is  optimal  speed-up. 

(2)  For  k  =  \og*n,  with  p  =  n/logn,  we  achieve  a  running  time  of 
C>(lognlog*n),  since  for  ^  =  log*n,  log'^^^n  <  2. 

A  variant  of  the  algorithm  will  yield  our  third  result. 

The  basic  algorithm  (of  the  previous  section)  had  two  stages.  In  the  first 
stage  (the  while  loop)  we  employed  an  almost  optimal  algorithm  (given  a  list 
of  length  m  it  performed  C>(mlog''m)  operations).  In  the  second  stage  (step 
5)  we  used  an  algorithm  that  performed  relatively  more  operations  (for  a  list 
of  length  m,  O(mlogm)  operations),  but  it  had  the  advantage  of  being  faster. 
To  profit  from  this  we  needed  to  ensure  that  the  numbers  of  operations 
performed  by  the  two  stages  were  roughly  the  same.  And,  in  fact,  this  was 
the  case,  because  the  list  processed  in  the  second  stage  was  sufficiently 
shorter.  Our  present  algorithm  pushes  this  methodology  further.  The 
algorithm  has  three  main  stages,  each  one  processing  a  relatively  shorter  list. 
Stage  1  uses  an  optimal  algorithm  that  is  relatively  slow;  its  effect  is  to 
sligthly  reduce  the  length  of  the  input  hst.  Stage  2  uses  an  almost  optimal 
algorithm;  it  is  faster.  Its  effect  is  to  further  reduce  the  length  of  the  list. 
Stage  3  uses  the  standard  deterministic  algorithm.  It  misses  optimality  by  a 
logarithmic  factor,  but  it  the  fastest  of  the  three  algorithms.  The  overall 
result  is  a  fast  optimal  algorithm.  This  methodology  was  also  used  in  [Vi- 
83b]. 
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The  input  for  Stage  1  is  the  input  linked  list  of  length  n.  The  output  of 
stage  1  (and  input  for  Stage  2)  is  a  linked  list  of  length  <  n/(log(*'^^)n)^  The 
output  of  Stage  2  (input  for  Stage  3)  is  a  linked  list  of  length  <  n/(log  n)^. 
Each  of  the  linked  lists  mentioned  above  is  given  in  an  array  whose  size  is  the 
same  as  the  length  of  the  hst.  Stage  3  simply  consists  of  applying  the  standard 
deterministic  parallel  algorithm. 

Remarks.  The  algorithm  will  be  described  in  less  detail  than  the  preceding 
algorithms.  In  particular: 

1.  At  each  timestep  of  stages  1  and  2  we  have  a  linked  list  that  was  obtained 
from  the  input  list  by  propagating  pointers  over  vertices  that  were  omitted 
(as  in  the  previous  section).  In  particular,  every  edge,  in  any  of  the  Hnked 
Lists  that  are  obtained  throughout  these  stages,  corresponds  to  a  directed  path 
in  the  original  input  list.  We  must  maintain  a  vector  (Uke  R  in  the  previous 
section)  that  holds,  for  each  such  edge,  the  length  of  its  original  path. 
However,  in  this  presentation  we  focus  only  on  the  transitions  from  a  given 
linked  list  to  a  shorter  one  and  avoid  mentioning  updates  of  this  vector. 

2.  Note  that  in  (stages  1  and  2)  we  only  mentioned  contractions  of  a  linked 
list  into  a  shorter  one  (the  up-sweep  part  using  the  term  of  Section  3).  We 
will  systematically  omit  the  corresponding  down-sweep  part  throughout  this 
section.  No  new  ideas  (beyond  Section  4)  are  required  in  order  to  fill  in  this 
part. 

Stage  1.  This  stage  employs  Procedure  1  repeatedly. 

Procedure  1. 

Input:  A  linked  list  of  length  m  given  in  an  array  of  length  m . 

Output:  A  linked  list  of  length  <  m/2  given  in  an  array  of  the  same  length  as 

the  list. 

Procedure  1  proceeds  as  follows. 

1.  Apply  the  basic  step  (of  Section  2)  k+1  times  to  obtain  a  log"^*"^^)/?: -ruling 
set.  (Denote  the  cardinality  of  this  ruling  set  by  m^). 

Erplanation.  The  output  list  of  Procedure  1  will  consist  of  the  vertices  of 
the  ruling  set.  So  for  each  vertex  v  in  the  ruling  set  the  remaining  job  for 
Procedure  1  is  to  traverse  the  input  Hst  up  to  the  first  successor  that  is  also  in 
the  ruling  set  (to  be  called  the  sublist  of  v).  (Recall  that  the  edge  length  of 
each  such  sublist  is  between  2  and  log^*''"^)^  -H  1.)  This  remaining  job  might 
cause  difficulties  if  we  used  a  naive  assignment  of  processors  to  their  jobs  as 
per  the  remark  following  Brent's  theorem  (particularly  if  optimal  speed-up  is 
desired).  Below,  we  show  how  to  overcome  these  difficulties. 

2.  Using  the  prefix  simi  algorithm  assign  numbers  from  1  to  mi  to  the 
vertices  of  the  ruling  set.  (Each  vertex  represents  its  subhst,  which  is 
thereby  imphcitly  numbered.) 

Conceptually,  in  stages  3  and  4  below,  we  partition  the  work  of  traversing 
the  mi  sublists  among  m/(log  mlog(*"^-^)m)  processors.  We  have  two  phases. 
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3.  Phase  1  consists  of  time  pulses.  At  each  time  pulse  each  processor  is  "in 
charge"  of  log  m  sublists.  For  each  of  its  sublists  the  processor  advances 
down  one  edge;  if  an  element  of  the  ruling  set  is  not  encountered  (meaning 
that  the  traversal  of  the  sublist  is  not  yet  completed)  then  the  processor 
remains  in  charge  of  the  sublist.  At  the  end  of  each  pulse  each  processor 
needs  to  acquire  a  few  new  sublists  in  order  to  restore  its  number  of  sublists 
to  log  m.   The  situation  is  as  follows: 

(1)  Each  sublist,  up  to  sublist  q,  for  some  q<mi,  has  been  previously 
assigned  to  a  processor. 

(2)  Processor  /,  0</<m/(logmlog'^*'^^)m),  needs  a,-  additional  sublists 
to  restore  its  number  of  sublists  to  logm . 

In  parallel,  the  processors  perform  a  prefix  sum  computation  on  a,-  (in  time 

C>(logm)).      Let    Z?,-  =  ^ay,    for    0</<m/(logmlog(*'^^)m).      Processor    i 

0 

acquires  a,-  new  sublists  by  taking  the  sublists  ^  +  ^,_i  +  1  through  q  +  bi. 
(This    makes    sense    only    as    long    as    we    refer    to    indices    ^m^.)    If 

m/(logmlog(*^^'m) 

2  aj  ^  nil  then  all  the  sublists  have  been  assigned  to  processors 

0 

and  we  proceed  to  phase  2.  Otherwise,  another  pulse  of  Phase  1  is 
performed.  (This  application  of  a  prefix  sum  computation  is  very  similar  to 
the  (known)  use  of  the  primitive  Fetch-and-Add  by  the  NYU-Ultracomputer 
for  the  parallel  implementation  of  a  queue  (see  [GLR-83]).) 

4.  Phase  2.  The  situation  is  that  each  of  the  m/(log  mlog^^'^^^m)  processors  is 
in  charge  of  at  most  log  m  sublists,  where  the  length  of  each  sublist  is 
<  log^*"^^)/?!.  Each  processor  simply  completes  the  traversal  for  all  its 
sublists. 

5.  A  prefix  sum  computation  is  applied  in  order  to  contract  the  input  array 
into  an  array  of  size  m^  containing  only  the  vertices  of  the  output  linked  list. 

Time  complexity  of  Stage  1. 

Complexity  of  an  iteration  of  Procedure  1. 

Step  1:  0{k)  time,  0{mk)  operations. 

Each  of  steps  2  and  5:  0(log  m)  time,  0{m)  operations. 

Step  3:  Recall  that  we  use  m/(log  mlog^*"^^^m)  processors.   In  each  pulse  each 

processor  needs  C>(log  m)  time  to  traverse  one  edge  in  each  of  its  sublists.  An 

additional  0(\og  m)  time  is  needed  for  the  prefix  sum  computation.    Since 

each  pulse  provides  for  traversals  of  m/log(*'''-^)m  edges  there  will  be  at  most 

log(*+i)m  pulses  before  the  queue  is  empty  and  we  proceed  to  Step  4.    So 

Step  3  takes  0(log  mlog*^*"^^)/?!)  time  and  0{m)  operations. 

Step  4:  0(log  mlog^^'^^^m)  time  using  0{m)  operations. 

So  an  iteration  of  Procedure  1  takes  0(logmlog(*'^^)m)   time  and  0{mk) 

operations. 
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Complexity  of  Stage  1.   The  output  of  Stage  1  is  a  linked  list  whose  length  is 

<  n/{\og^^'^^^ny.  Since  at  each  invocation  of  Procedure  1,  mi  :S  m/2,  we 
need  to  use  at  most  Slog*^*"^^)/!  iterations  of  Procedure  1.  So  the  total  number 
of  operations  in  Stage  1  is  0(nk  +  nk/2  +  nk/4  +  ...)  =  0{nk)  and  the  total 
time  is  0(log  nlog^^  +  ^^nlog^*^^)^)  <  q^Iq^  /zlog(*)/i). 

Stage  2.  Stage  2  consists  of  k  iterations  of  Procedure  2. 

Iteration  i  of  Procedure  2,  (1  <  i  <  k), 

Letj  =  k+l-i. 

Input.  A  linked  list  of  length  at  most  /i/Qog*^"^^)?!)^,  given  in  an  array  having 

the  same  length  as  the  list. 

Output.  A  linked  list  of  length  at  most  n/{\og^^ny,  given  in  an  array  having 

the  same  length  as  the  list. 

1.  Apply  3\og^^'^^n  -  3\og^'^'^^n  iterations  of  Routine  1. 

Iteration  g  of  Routine  1,  0  <  g  <  3\og^-^^'>n  -  3\og^^^^n. 

Input.  A  hnked  list  of  length  m  <  2~^n/{\og^^'^^ny,  given  in  an  array  of 

length  <  n/(log*^'^^)n)^.  (The  vertices  of  the  linked  list  are  "spread  over"  the 

array  which  may  have  more  entries  than  the  length  of  the  Hst.  Redundant 

entries  of  the  array  (i.e.,  entries  that  represent  vertices  which  are  not  in  the 

input  list  for  iteration  g)  are  marked  as  such.  The  reason  for  this  "wasteful" 

representation  of  the  input  is  that  iterations  of  Routine  1  "save  time"  by  not 

contracting  their  input  aiTay  to  include  only  their  output  list.  Only  the  end  of 

Procedure  2  contracts  the  lijiked  list  at  hand) . 

Output.    A    linked    of    length    mi  <  m/2,    given    in    an    array    of    length 

<  n/(log<^  +  i)n)^ 

(a)  Apply  the  basic  step  (of  Section  2)  j+1  times  to  obtain  a  log'^'^^^n-ruling 
set.  (Denote  the  cardinality  of  this  ruling  set  by  mi). 

Explanation.  The  output  list  of  the  present  iteration  of  Routine  1  will 
consist  of  the  vertices  of  the  ruling  set.  So  for  each  vertex  v  in  the  ruling  set 
the  remaining  job  is  to  traverse  the  sublist  of  v .  (Recall  that  the  edge  length 
of  each  such  subhst  is  between  2  and  log^'^^^n  +  1.) 

(b)  for  1  <  V  <  n/(log<^'+i)n)3  pardo 

if  V  ^   the  ruling  set 
then  traverse  the  sublist  of  v . 
This  completes  iteration  g  of  Routine  1. 
Step  2  below  concludes  the  present  iteration  of  Procedure  2. 

2.  A  prefix  sum  computation  is  applied  in  order  to  contract  the  input  array 
into  an  array  containing  only  the  vertices  of  the  linked  list  at  hand. 

Time  complexity  of  Stage  2. 

Complexity  of  iteration  g  of  Routine  1:  Using  n/{\og^'^^^ny)  processors, 
Step  (a)  takes  0(J)  time  and  Step  (b)  takes  0{]og^^'^^n)  time.    This  yields  a 
bound  of  0(nj/{\og^^'^'>n)~)  operations  taking  C(Iog'^'^^);i  +  j)  time. 
Complexity  of  iteration  /  of  Procedure  2:  Step  1  consists  of  C>(log*^"^^)/i) 
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invocations  of  Routine  1.  Step  2  needs  C>(n/(log^"^^)n)^)  operations  and 
C>(log  n)  time.  Thus  the  i-th  iteration  of  Procedure  2  performs 
0(Jn/\og^^^^n)  operations  in  time 

C((log^^i)n  +  j)-\og^^^^n  +  logn)  =  O(logn). 

So,  overall,  Stage  2  performs  C>(y  - — /.^      )  <  0(kn)  operations  in  time 

_,'  =  ilog^^^-'n 

C>(^'logn). 

Stage  3  requires  C>(log  n)  time  and  0{n/\og^n)  operations.  It  is  also  easy 
to  bound  the  time  and  number  of  operations  required  by  the  down-sweep 
part  (which  is  missing  in  the  above  description)  by  the  same  time  and  number 
of  operations  as  for  stages  1  and  2. 

Putting  everything  together,  and  applying  Brent's  theorem,  we  deduce 
the  algorithm  runs  in  time  0(nk/p  +  log*^)nlogn  +  ^logn),  using  any 
number  p  of  processors,  where  k:^\og'n.  The  implementation  problems  as 
per  the  remark  following  Brent's  theorem  can  be  readily  overcome. 

Remark:  We  can  also  obtain  a  class  of  algorithms  taking  time  0(\ogn)  on 
/ilog(*)/i/logn  processors,  for  any  fixed  positive  integer  k,  as  follows.  By  way 
of  motivation,  we  observe  that,  in  the  algorithm  just  described,  stage  2  is 
faster  than  stage  1  (on  equal  length  inputs),  but  requires  more  operations. 
Therefore,  by  substituting  stage  2  for  stage  1,  we  might  expect  to  reduce  the 
running  time  and  increase  the  total  number  of  operations.  So,  in  the  above 
algorithm,  we  replace  Stage  1  with  Routine  1  applied  Slog^^'^'^^n  times,  where 
the  input  for  the  g-th  iteration  is  a  linked  list  of  length  <2~'?n,  stored  in  an 
array  of  length  n;  also,  in  part  (a)  of  the  routine,  we  seek  a  log (*"^^)/i -ruling 
set.  Then  we  perform  the  rest  of  the  above  algorithm  with  no  change.  We 
achieve  a  running  time  of  0(k\ogn)  taking 

0(nlog(*'^^)nlog(*^"^-)/z  -I-  kn)  <  0(n\og^^^n)  operations.  Our  result  follows 
by  Brent's  theorem.  This  shows  that  Wyllie's  conjecture  which  was 
mentioned  in  the  introduction  is  not  correct. 

6.    Open  problems 

(1)  Is  there  an  optimal  speed-up  algorithm  for  the  list  ranking  problem  using 
n/\ogn  processors  and  running  in  time  0(log/2)? 

(2)  We  recall  that  the  new  coin  tossing  technique  distinguishes  the  PRAM 
model  from  the  more  abstract  PRAM-INFINITY  model.  We  are  not  aware 
of  any  other  technique  having  this  property.  Are  there  others?  In  addition, 
this  remark  calls  for  a  "metatheoretical"  discussion  of  the  applicability  of 
PRAM-INFINITY  lower  bounds  to  PRAMs.  We  note  that  a  lower  bound  in 
the  PRAM-INFINri'Y  model  is  stronger  than  the  same  lower  bound  in  the 
decision  tree  model,  a  model  that  is  often  used  when  proving  lower  bounds. 
Also,  non-trivial  lower  bounds  have  been  proved  for  the  PRAM-INFINITY 
model.  Thus  it  seems  useful  to  ascertain  the  applicability  and  limitations  of 
such  lower  bounds. 
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